Term Deposit Predictor¶

by Jackson Lu, Daniel Yorke, Charlene Chin , and Mohammed Ibrahim 2025/11/21

Summary¶

This project focuses on predicting whether clients will subscribe to a term deposit using the Bank Marketing dataset. A logistic regression model was developed, incorporating all available predictor variables after appropriate preprocessing. The model was evaluated using shuffled cross-validation with an emphasis on the F1 score balance precision and recall. The analysis was conducted using Python and key libraries such as NumPy, pandas, and scikit-learn, with all code documented for reproducibility. Our final classifier performed fairly well on an unseen test data set, achieving an accuracy of 0.844, f1-score of 0.551, and roc-auc score of 0.91. This indicates that the model is reasonably effective at identifying clients who will subscribe to a term deposit, although there is room for improvement, particularly in recall. Further refinements could involve exploring additional features, tuning hyperparameters, or experimenting with alternative modeling techniques to enhance predictive performance.

Introduction¶

Financial institutions rely heavily on effective marketing strategies to identify which clients are most likely to subscribe to long-term financial products such as term deposits. These products support both customer financial planning and bank stability, yet subscription rates are often low due to ineffective targeting. Traditional marketing approaches depend heavily on human judgment, intuition, and repeated client contact, which can be costly, time-consuming, and inconsistent in effectiveness. As a result, developing more objective and data-driven methods for understanding and predicting client behaviour has become increasingly important.

In this project, we ask whether a machine learning algorithm can accurately predict whether a bank client will subscribe to a term deposit based on demographic attributes, financial information, and past marketing interactions. This question is important because traditional marketing strategies tend to rely on broad outreach rather than individualized prediction, leading to inefficiencies and potential client fatigue. Furthermore, understanding which client characteristics are associated with subscription behavior may support more personalized communication strategies and improve customer experience. If a machine learning classifier such as logistic regression can reliably predict subscription outcomes, it may enable more data-driven, scalable, and cost-effective marketing decisions, ultimately improving the performance of future campaigns.

Methods¶

Data¶

The dataset used in this project is the Bank Marketing dataset, created by By Sérgio Moro, P. Cortez, P. Rita. in 2014 at the University of Minho in Portugal as part of a series of direct marketing campaigns conducted by a Portuguese banking institution. The data is publicly available through the UCI Machine Learning Repository and contains information on client demographics, financial status, and details related to previous marketing contacts. The dataset can be found here.

The dataset contains 45,211 observations and 17 columns in total, comprising 16 predictor variables and 1 binary target variable (y) indicating whether the client subscribed to a term deposit. Each record represents a client who was contacted during a marketing campaign. The predictor variables capture a mix of demographic, financial, and campaign-related information. Among these, several features contain missing values (e.g., job, education, contact, and poutcome), requiring appropriate imputation or handling during preprocessing. Missing categorical values were imputed with a constant placeholder (“unknown”), and numerical features were standardized using StandardScaler to ensure comparability across variables. The target variable y is binary (yes or no), with only around 11–12% of the clients subscribing to a term deposit, resulting in a class imbalance that must be considered in model evaluation. Together, these attributes provide a rich and diverse feature set for assessing whether logistic regression can effectively capture the patterns associated with successful term-deposit subscriptions.

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Analysis¶

A logistic regression classifier was developed to model the probability that a client would subscribe to a term deposit (y). All predictor variables from the original dataset were included after appropriate preprocessing, which involved encoding categorical features with OneHotEncoder and scaling numerical features using StandardScaler. The dataset was randomly divided into a training set (80%) and a test set (20%) to enable unbiased performance evaluation.

Prior exploratory analysis examined the distributions of all input variables in the training set, with plots colored by the binary outcome (“yes” or “no”). Most numerical predictors—such as previous, pdays, campaign, duration, age, and balance—displayed substantial overlap between the two classes. However, some features, particularly duration, showed clear differences: clients who subscribed tended to have significantly longer call durations. This observation is consistent with findings from the original dataset documentation, confirming duration as a strong predictor of subscription. Other variables, such as campaign, previous, and pdays, were highly right-skewed with long tails, while categorical variables (e.g., job, marital status, education, and contact type) appeared to carry complementary contextual information about clients. These exploratory patterns were visualized in Figure 1, which displays feature distributions by subscription status. Figure 2 presents the correlation matrix among numerical predictors.

Correlation matrices (both Pearson and Spearman) were also examined to assess relationships among predictors. Overall, correlations between numerical features were weak, indicating low multicollinearity, which supports the use of logistic regression as an interpretable linear model. Some moderate associations were found among pdays, previous, and campaign, reflecting their shared connection to marketing contact history.

Model evaluation was conducted using stratified 5-fold cross-validation to address class imbalance. Performance was primarily assessed using the F1-score, which balances precision and recall, along with accuracy and ROC-AUC for comprehensive evaluation. Across the five folds, the model achieved a mean accuracy of 0.844, a mean F1-score of 0.551, and a mean ROC-AUC of 0.910. Training and test results were closely aligned, indicating minimal overfitting. These results suggest that the logistic regression model provides strong discriminatory ability, though recall could be improved by further class rebalancing or feature engineering.

All analysis was conducted in Python (Van Rossum & Drake, 2009) using NumPy (Harris et al., 2020), pandas (McKinney, 2010), scikit-learn (Pedregosa et al., 2011), and Altair for visualization. All code for data processing, modeling, and figure generation is documented within this notebook for reproducibility.

Results and Discussion¶

The results demonstrate that logistic regression can effectively distinguish clients likely to subscribe to a term deposit, achieving strong performance across multiple evaluation metrics. The identification of duration as the most influential predictor aligns with expectations—longer calls typically indicate higher engagement and interest in the product. The moderate F1-score, however, reflects difficulty in recalling all positive cases, which was anticipated due to the dataset’s pronounced class imbalance (only around 11–12% subscribed).

These findings highlight the model’s practical potential: banks could apply such a model to prioritize high-probability clients, improving campaign efficiency while reducing unnecessary contact costs. The high ROC-AUC value (0.91) suggests that even a simple, interpretable model can meaningfully support decision-making in marketing strategy.

Future work could explore whether non-linear models (e.g., tree-based or ensemble methods) further improve recall, or whether feature engineering on time-related or interaction variables enhances predictive performance. In addition, investigating the relative influence of demographic versus campaign-related features could deepen understanding of what drives client subscription behavior.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import StratifiedKFold
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
import altair as alt
import altair_ally as ally
from altair import datum

# Enable Altair to render in Jupyter
alt.data_transformers.enable('json', prefix='../data/altair/')
Out[1]:
DataTransformerRegistry.enable('json')
In [2]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 
In [3]:
# Define the folder path
folder_path = '../data/'
altair_path = '../data/altair/'

# Ensure the directory exists (create it if it doesn't)
os.makedirs(folder_path, exist_ok=True)
os.makedirs(altair_path, exist_ok=True)

# Define file paths
features_file_path = os.path.join(folder_path, 'bank_marketing_features.csv')
targets_file_path = os.path.join(folder_path, 'bank_marketing_targets.csv')

# Export the DataFrames to CSV
X.to_csv(features_file_path, index=False) # index=False prevents pandas from writing row indices to the file
y.to_csv(targets_file_path, index=False)

df = pd.concat([X, y], axis=1)
In [4]:
# to ignore warning messages from python ally
warnings.filterwarnings(
    "ignore",
    message="You passed a `<class 'narwhals.stable.v1.DataFrame'>` to `is_pandas_dataframe`.",
    category=UserWarning,
    module="altair.utils.data"
)
In [5]:
df.head()
Out[5]:
age job marital education default balance housing loan contact day_of_week month duration campaign pdays previous poutcome y
0 58 management married tertiary no 2143 yes no NaN 5 may 261 1 -1 0 NaN no
1 44 technician single secondary no 29 yes no NaN 5 may 151 1 -1 0 NaN no
2 33 entrepreneur married secondary no 2 yes yes NaN 5 may 76 1 -1 0 NaN no
3 47 blue-collar married NaN no 1506 yes no NaN 5 may 92 1 -1 0 NaN no
4 33 NaN single NaN no 1 no no NaN 5 may 198 1 -1 0 NaN no
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   age          45211 non-null  int64 
 1   job          44923 non-null  object
 2   marital      45211 non-null  object
 3   education    43354 non-null  object
 4   default      45211 non-null  object
 5   balance      45211 non-null  int64 
 6   housing      45211 non-null  object
 7   loan         45211 non-null  object
 8   contact      32191 non-null  object
 9   day_of_week  45211 non-null  int64 
 10  month        45211 non-null  object
 11  duration     45211 non-null  int64 
 12  campaign     45211 non-null  int64 
 13  pdays        45211 non-null  int64 
 14  previous     45211 non-null  int64 
 15  poutcome     8252 non-null   object
 16  y            45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB
In [7]:
df.describe()
Out[7]:
age balance day_of_week duration campaign pdays previous
count 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000
mean 40.936210 1362.272058 15.806419 258.163080 2.763841 40.197828 0.580323
std 10.618762 3044.765829 8.322476 257.527812 3.098021 100.128746 2.303441
min 18.000000 -8019.000000 1.000000 0.000000 1.000000 -1.000000 0.000000
25% 33.000000 72.000000 8.000000 103.000000 1.000000 -1.000000 0.000000
50% 39.000000 448.000000 16.000000 180.000000 2.000000 -1.000000 0.000000
75% 48.000000 1428.000000 21.000000 319.000000 3.000000 -1.000000 0.000000
max 95.000000 102127.000000 31.000000 4918.000000 63.000000 871.000000 275.000000

Here we are showing different features distributions

In [8]:
ally.alt.data_transformers.enable('vegafusion')
ally.dist(df, color='y')
Out[8]:

Figure 1. Key Feature Distributions

Here we are showing correlations between different features.

In [9]:
ally.corr(df)
Out[9]:

Figure 2. Feature Correlations

We created piplines to carry out transformation on numerical and categorical features separately. The numerical features were standardized using StandardScaler, while the categorical features were encoded using OneHotEncoder. The final pipeline combined these preprocessing steps with the LogisticRegression model.

In [10]:
# Simple pipeline example
numeric_pipeline = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler()
)

categorical_pipeline = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='unknown'),
    OneHotEncoder(drop='first')
)
In [11]:
# First, let's prepare the data
# Handle categorical variables in features
categorical_columns = X.select_dtypes(include=['object']).columns.tolist()
numerical_columns = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Categorical columns: {categorical_columns}")
print(f"Numerical columns: {numerical_columns}")
Categorical columns: ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
Numerical columns: ['age', 'balance', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous']
In [12]:
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numerical_columns),
        ('cat', categorical_pipeline, categorical_columns)
    ])
In [13]:
full_pipeline = make_pipeline(
    preprocessor,
    LogisticRegression(random_state=522, max_iter=2000, class_weight="balanced")
)
In [14]:
# Prepare target variable
# LabelEncoder just creates a simple mapping - no statistics involved
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y.values.ravel())

# What it does:
# 'no'  → 0
# 'yes' → 1

print(f"Target classes: {label_encoder.classes_}")
# Note: ravel() converts your 2D DataFrame column (522, 1) into a 1D array (522,) so LabelEncoder can process it properly!
Target classes: ['no' 'yes']
In [15]:
# Split the data
# 'stratify=y_encoded' ensures that your train and test sets have the same class distribution as your original dataset.
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=522, stratify=y_encoded
)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
Training set size: (36168, 16)
Test set size: (9043, 16)
In [16]:
# Use stratified CV for imbalanced data
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=522)
cv_results = cross_validate(
    full_pipeline,
    X,
    y_encoded,
    cv=skf,  # ← Use stratified splits!
    scoring={'accuracy': 'accuracy', 'f1': 'f1', 'peri''roc_auc': 'roc_auc'},
    return_train_score=True,
    n_jobs=-1
)
In [17]:
pd.DataFrame(cv_results).agg(['mean', 'std']).round(3).T
Out[17]:
mean std
fit_time 0.130 0.003
score_time 0.029 0.001
test_accuracy 0.844 0.004
train_accuracy 0.845 0.001
test_f1 0.551 0.008
train_f1 0.554 0.002
test_periroc_auc 0.910 0.004
train_periroc_auc 0.911 0.001

Table 1. Cross-validation performance metrics for logistic regression model

Our prediction model performed quite well on test data, with a final overall accuracy of 0.844 and F1 score of 0.551. The ROC-AUC score of 0.91 indicates that the model is effective at distinguishing between clients who will and will not subscribe to a term deposit. However, there is room for improvement in identifying all potential subscribers, as some were missed by the model.

Data Validation¶

In [18]:
# imports 
import pandera.pandas as pa
In [19]:
df.head()
Out[19]:
age job marital education default balance housing loan contact day_of_week month duration campaign pdays previous poutcome y
0 58 management married tertiary no 2143 yes no NaN 5 may 261 1 -1 0 NaN no
1 44 technician single secondary no 29 yes no NaN 5 may 151 1 -1 0 NaN no
2 33 entrepreneur married secondary no 2 yes yes NaN 5 may 76 1 -1 0 NaN no
3 47 blue-collar married NaN no 1506 yes no NaN 5 may 92 1 -1 0 NaN no
4 33 NaN single NaN no 1 no no NaN 5 may 198 1 -1 0 NaN no
In [20]:
# defining the schema
schema = pa.DataFrameSchema({
    "age": pa.Column(int, pa.Check.between(15, 120)),
    
    "job": pa.Column(
        str,
        pa.Check.isin(["admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
                                       "blue-collar","self-employed","retired","technician","services"])
    ),

    "marital": pa.Column(
        str,
        pa.Check.isin(["married", "single", "divorced"])

    ),

    "education": pa.Column(
        str, 
        pa.Check.isin(["unknown","secondary","primary","tertiary"])
        ),

    "default": pa.Column(
        str,
        pa.Check.isin(["yes", "no", "unknown"])
    ),

    "balance": pa.Column(
        int,
        checks=[
            pa.Check.ge(-5000),
            pa.Check.le(500000)
        ]
    ),

    "housing": pa.Column(
        str,
        pa.Check.isin(["yes", "no"])
    ),

    "loan": pa.Column(
        str,
        pa.Check.isin(["yes", "no"])
    ),

    "contact": pa.Column(
        str,
        pa.Check.isin(["cellular", "telephone", "unknown"])
    ),

    "day_of_week": pa.Column(
        int,
        pa.Check.isin([1, 2, 3, 4, 5, 6, 7])
    ),

    "month": pa.Column(
        str,
        pa.Check.isin(["jan", "feb", "mar", "apr", "may", "jun",
                       "jul", "aug", "sep", "oct", "nov", "dec"])
    ),

    "duration": pa.Column(
        int,
        checks=[
            pa.Check.ge(0),
            pa.Check.le(3600)
        ]
    ),

    "campaign": pa.Column(
        int,
        checks=[
            pa.Check.ge(1),
            pa.Check.le(300)
        ]
    ),

    "previous": pa.Column(
        int,
        checks=[
            pa.Check.ge(0),
            pa.Check.le(24)
        ]
    ),

    "poutcome": pa.Column(
        str,
        pa.Check.isin(["unknown", "other", "failure", "success"])
    ),

    "y": pa.Column(
        str,
        pa.Check.isin(["yes", "no"])
    ),    
})
In [21]:
# count number of failures per column
try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    print(err.failure_cases["column"].value_counts())
column
poutcome       36959
day_of_week    35413
contact        13020
education       1857
job              288
previous          32
duration           3
balance            2
Name: count, dtype: int64
In [22]:
# unique errors per column
try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    print(err.failure_cases.groupby("column")["failure_case"].nunique())
column
balance         2
contact         0
day_of_week    24
duration        3
education       0
job             0
poutcome        0
previous       16
Name: failure_case, dtype: int64
In [23]:
df["day_of_week"].unique()
Out[23]:
array([ 5,  6,  7,  8,  9, 12, 13, 14, 15, 16, 19, 20, 21, 23, 26, 27, 28,
       29, 30,  2,  3,  4, 11, 17, 18, 24, 25,  1, 10, 22, 31])
In [24]:
# uncomment below to see full error message
# validated_df = schema.validate(df, lazy = True)

From the above analysis and additional column by column analysis, we can see a few notable observations:

  • There were null values observed for the following columns: "job", "contact", "education", "poutcome",
  • In the bank "balances" column, there were some outlier values: -8019, -6847. These are unusual bank account balances, but probably do not need to be removed.
  • In the "day_of_week" column, it appears that we assumed the values could be from 1 to 7, but the values in fact go up to 31, which may indicate this should be encoded as "day_of_month" instead.
  • The "duration" column, we observe some outliers of 3881, 4918, 3785 which indicates a conversation that was longer than an hour. We may want to remove these from analysis.
  • In the "previous" column, we observe outliers where there were greater than 24 per year (meaning greater than 2 contacts per month, on average). One extreme outlier was the value 275 which may need to be removed.
In [25]:
import pandera as pa
from pandera import DataFrameSchema, Check

# Function to check duplicates per column
def duplicate_columns_report(df: pd.DataFrame):
    report = []
    for col in df.columns:
        duplicated_count = df.duplicated(subset=[col]).sum()
        report.append({
            "column": col,
            "total_rows": len(df),
            "duplicated_observations": duplicated_count,
            "unique_observations": len(df) - duplicated_count,
            "has_duplicates": duplicated_count > 0,
            "percent_duplicates": duplicated_count / len(df) * 100
        })
    return pd.DataFrame(report)

# define a DataFrame-level Pandera check to enforce no duplicates for the whole DataFrame
def no_duplicates_df(df: pd.DataFrame):
    return df.duplicated().sum() == 0 

schema = DataFrameSchema(
    checks=[
        Check(
            no_duplicates_df,
            element_wise=False,
            error="DataFrame contains duplicate rows"
        )
    ]
)

# Validate the DataFrame (collect all failures)
try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
    print("DataFrame-level duplicate check failed")
    print(e.failure_cases)

# Generate a report for each column
column_duplicates_report = duplicate_columns_report(df)
print(column_duplicates_report)
         column  total_rows  duplicated_observations  unique_observations  \
0           age       45211                    45134                   77   
1           job       45211                    45199                   12   
2       marital       45211                    45208                    3   
3     education       45211                    45207                    4   
4       default       45211                    45209                    2   
5       balance       45211                    38043                 7168   
6       housing       45211                    45209                    2   
7          loan       45211                    45209                    2   
8       contact       45211                    45208                    3   
9   day_of_week       45211                    45180                   31   
10        month       45211                    45199                   12   
11     duration       45211                    43638                 1573   
12     campaign       45211                    45163                   48   
13        pdays       45211                    44652                  559   
14     previous       45211                    45170                   41   
15     poutcome       45211                    45207                    4   
16            y       45211                    45209                    2   

    has_duplicates  percent_duplicates  
0             True           99.829687  
1             True           99.973458  
2             True           99.993364  
3             True           99.991153  
4             True           99.995576  
5             True           84.145451  
6             True           99.995576  
7             True           99.995576  
8             True           99.993364  
9             True           99.931433  
10            True           99.973458  
11            True           96.520758  
12            True           99.893831  
13            True           98.763575  
14            True           99.909314  
15            True           99.991153  
16            True           99.995576  
In [26]:
from pandera import Column, DataFrameSchema, Check

# Define threshold
MISSING_THRESHOLD = 0.05  # 5%

# Custom check function for missingness
def missing_within_threshold(series, threshold=MISSING_THRESHOLD):
    # Returns True if fraction of missing values is within the threshold
    return series.isna().mean() <= threshold

# Create schema with a missingness check for all columns
schema_columns = {
    col: Column(
        nullable=True,
        checks=[
            Check(
                lambda s: missing_within_threshold(s),
                element_wise=False,
                error=f"Missing values exceed {MISSING_THRESHOLD*100}%"
            )
        ]
    )
    for col in df.columns
}

schema = DataFrameSchema(schema_columns)

# Validate in lazy mode to collect all errors
try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
    # e.failure_cases has all validation failures
    print("Some columns exceeded the missingness threshold")

report = pd.DataFrame({
    "total_missing": df.isna().sum(),
    "missing_fraction": df.isna().mean(),
    "exceeds_threshold": df.isna().mean() > MISSING_THRESHOLD
})

print(report)
             total_missing  missing_fraction  exceeds_threshold
age                      0          0.000000              False
job                    288          0.006370              False
marital                  0          0.000000              False
education             1857          0.041074              False
default                  0          0.000000              False
balance                  0          0.000000              False
housing                  0          0.000000              False
loan                     0          0.000000              False
contact              13020          0.287983               True
day_of_week              0          0.000000              False
month                    0          0.000000              False
duration                 0          0.000000              False
campaign                 0          0.000000              False
pdays                    0          0.000000              False
previous                 0          0.000000              False
poutcome             36959          0.817478               True
y                        0          0.000000              False
In [27]:
from pandera import DataFrameSchema, Column, Check

# Expected target distribution
expected_distribution = {"no": 0.5, "yes": 0.5}
tolerance = 0.05  # ±5% allowed

# Custom function for Pandera to check if the distribution is within tolerance
def check_target_distribution(series: pd.Series) -> bool:
    counts = series.value_counts(normalize=True)
    for cat, expected_prop in expected_distribution.items():
        observed_prop = counts.get(cat, 0)
        if abs(observed_prop - expected_prop) > tolerance:
            return False  # fail if any category is out of tolerance
    return True  # pass if all categories are within tolerance

# Create a schema for the target column
schema = DataFrameSchema({
    "y": Column(
        pa.String,
        checks=[
            Check(
                check_target_distribution,
                element_wise=False,
                error="Target distribution is outside the expected tolerance"
            )
        ],
        nullable=False
    )
})

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
    print("Target distribution check failed:")
    print(e.failure_cases)

def target_distribution_report(series: pd.Series):
    counts = series.value_counts(normalize=True)
    report = []
    for cat, expected_prop in expected_distribution.items():
        obs_prop = counts.get(cat, 0)
        within_tol = abs(obs_prop - expected_prop) <= tolerance
        report.append({
            "category": cat,
            "observed": obs_prop,
            "expected": expected_prop,
            "within_tolerance": within_tol
        })
    return pd.DataFrame(report)

report = target_distribution_report(df["y"])
print(report)
Target distribution check failed:
  column  failure_case index schema_context  \
0      y         False  None         Column   

                                               check  check_number  
0  Target distribution is outside the expected to...             0  
  category  observed  expected  within_tolerance
0       no  0.883015       0.5             False
1      yes  0.116985       0.5             False
In [28]:
import numpy as np
from scipy.stats import chi2_contingency

# -------------------------
# CORRELATION FUNCTIONS
# -------------------------

def cramers_v(x, y):
    confusion = pd.crosstab(x, y)
    chi2 = chi2_contingency(confusion)[0]
    n = confusion.sum().sum()
    phi2 = chi2 / n
    r, k = confusion.shape
    return np.sqrt(phi2 / min(k - 1, r - 1))

def correlation_ratio(categories, values):
    categories = categories.astype(str)
    values = values.astype(float)
    means, counts = [], []
    overall_mean = np.mean(values)
    for cat in np.unique(categories):
        vals = values[categories == cat]
        means.append(np.mean(vals))
        counts.append(len(vals))
    between = np.sum(counts * (means - overall_mean)**2)
    total = np.sum((values - overall_mean)**2)
    return np.sqrt(between / total) if total != 0 else 0

# -------------------------
# TARGET-FEATURE CORRELATION
# -------------------------

def target_correlation_report(df, target="y", threshold=0.8):
    correlations = []
    y = df[target]

    for col in df.columns:
        if col == target:
            continue
        x = df[col]
        if np.issubdtype(x.dtype, np.number) and np.issubdtype(y.dtype, np.number):
            corr = abs(x.corr(y))
        elif x.dtype == "object" and y.dtype == "object":
            corr = cramers_v(x, y)
        else:
            if np.issubdtype(x.dtype, np.number):
                corr = correlation_ratio(y.astype(str), x)
            else:
                corr = correlation_ratio(x.astype(str), y)
        correlations.append({"feature": col, "correlation": corr, "anomalous": corr > threshold})

    report = pd.DataFrame(correlations)
    anomalous = report[report["anomalous"]]
    return report, anomalous

# -------------------------
# FEATURE-FEATURE CORRELATION
# -------------------------

def feature_correlation_report(df, target="y", threshold=0.8):
    features = df.drop(columns=[target])
    cols = features.columns
    results = []

    for i, c1 in enumerate(cols):
        for c2 in cols[i+1:]:
            x, y_ = features[c1], features[c2]
            if np.issubdtype(x.dtype, np.number) and np.issubdtype(y_.dtype, np.number):
                corr = abs(x.corr(y_))
            elif x.dtype == "object" and y_.dtype == "object":
                corr = cramers_v(x, y_)
            else:
                if np.issubdtype(x.dtype, np.number):
                    corr = correlation_ratio(y_.astype(str), x)
                else:
                    corr = correlation_ratio(x.astype(str), y_)
            results.append({"feature_1": c1, "feature_2": c2, "correlation": corr, "anomalous": corr > threshold})

    report = pd.DataFrame(results)
    anomalous = report[report["anomalous"]]
    return report, anomalous

# -------------------------
# USAGE
# -------------------------

target_report, anomalous_target = target_correlation_report(df, target="y", threshold=0.8)
feature_report, anomalous_features = feature_correlation_report(df, target="y", threshold=0.8)

# -------------------------
# PRINT REPORTS
# -------------------------
print("\n--- TARGET vs FEATURES CORRELATION ---")
print(target_report)
if not anomalous_target.empty:
    print("\n--- ANOMALOUS TARGET CORRELATIONS ---")
    print(anomalous_target)

print("\n--- FEATURE vs FEATURE CORRELATION ---")
print(feature_report)
if not anomalous_features.empty:
    print("\n--- ANOMALOUS FEATURE-FEATURE CORRELATIONS ---")
    print(anomalous_features)
--- TARGET vs FEATURES CORRELATION ---
        feature  correlation  anomalous
0           age     0.025155      False
1           job     0.136429      False
2       marital     0.065926      False
3     education     0.073427      False
4       default     0.022160      False
5       balance     0.052838      False
6       housing     0.139103      False
7          loan     0.068091      False
8       contact     0.011945      False
9   day_of_week     0.028348      False
10        month     0.260237      False
11     duration     0.394521      False
12     campaign     0.073172      False
13        pdays     0.103621      False
14     previous     0.093236      False
15     poutcome     0.469914      False

--- FEATURE vs FEATURE CORRELATION ---
    feature_1  feature_2  correlation  anomalous
0         age        job     0.501129      False
1         age    marital     0.433431      False
2         age  education     0.215201      False
3         age    default     0.017879      False
4         age    balance     0.097783      False
..        ...        ...          ...        ...
115  campaign   previous     0.032855      False
116  campaign   poutcome     0.112405      False
117     pdays   previous     0.454820      False
118     pdays   poutcome     0.878962       True
119  previous   poutcome     0.539278      False

[120 rows x 4 columns]

--- ANOMALOUS FEATURE-FEATURE CORRELATIONS ---
    feature_1 feature_2  correlation  anomalous
118     pdays  poutcome     0.878962       True

The above analysis shows:-

  • The duplication percentage is high as expected, and the lowest is in the "balance" column, as expected.
  • We can observe missing values on "job", "education", "contact", and "poutcome", but the missingness exceeded the threshold on "contact" and "poutcome".
  • We can see the class imbalance in the response (no is 0.88, yes is 0.11). We set the balance threshold to 50% with a tolerance of 5%.
  • Taking the threshold of 0.8, there is no anomalous correlation between the response variable and any of the input variables, but there is only one anomalous correlation between the "pday" and "poutcome" variables.

References¶

Bera, Suman, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. 2019. “Fair Algorithms for Clustering.” https://www.semanticscholar.org/paper/Fair-Algorithms-for-Clustering-Bera-Chakrabarty/34a46c62cb3a7809db4ed7d0c1a651f538b9fe87

Ziko, Imtiaz, Eric Granger, Jing Yuan, and Ismail Ayed. 2019. “Clustering with Fairness Constraints: A Flexible and Scalable Approach.” https://www.semanticscholar.org/paper/Clustering-with-Fairness-Constraints%3A-A-Flexible-Ziko-Granger/d56841fe68f2a913583a40edf541efeaed0a7e5b

Lamy, Alexandre, Ziyuan Zhong, Aditya Menon, and Nakul Verma. 2019. “Noise-Tolerant Fair Classification.” https://www.semanticscholar.org/paper/Noise-tolerant-fair-classification-Lamy-Zhong/c4ac496bf57410638260196a25d8ae3366ea03c7

Iosifidis, Vasileios, and Eirini Ntoutsi. 2019. “AdaFair: Cumulative Fairness Adaptive Boosting.” https://www.semanticscholar.org/paper/AdaFair%3A-Cumulative-Fairness-Adaptive-Boosting-Iosifidis-Ntoutsi/18fe4800f3c85f315d79063d6b0fe38c7610ad45

Vaz, Afonso, Rafael Izbicki, and Rafael Stern. 2018. “Quantification under Prior Probability Shift: The Ratio Estimator and Its Extensions.” https://www.semanticscholar.org/paper/Quantification-under-prior-probability-shift%3A-the-Vaz-Izbicki/50adf7b8fd1274149a195ef4a7b4ab9f84b3dd13

Zhu, Zining, Jekaterina Novikova, and Frank Rudzicz. 2018. “Semi-supervised Classification by Reaching Consensus among Modalities.” https://www.semanticscholar.org/paper/Semi-supervised-classification-by-reaching-among-Zhu-Novikova/072956b72ddc23f276b18da0c9a6ccc5ed5067e8

Yoon, Jinsung, William R. Zame, and Mihaela van der Schaar. 2017. “ToPs: Ensemble Learning with Trees of Predictors.” https://www.semanticscholar.org/paper/ToPs%3A-Ensemble-Learning-With-Trees-of-Predictors-Yoon-Zame/05268691d4bf6b84e71ae421a3af0ab27cd3d8f1

Ross, Stéphane, Paul Mineiro, and John Langford. 2014. “Normalized Online Learning.” https://www.semanticscholar.org/paper/Normalized-Online-Learning-Ross-Mineiro/1d127af1174a3f0f36e9181348eaa731d3cca67b